Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions
نویسندگان
چکیده
Data Extraction from the World Wide Web is a well known, non solved, and a critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount of Web data available. These data have usually a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build realible systems. In this chapter we propose an updated state of the art revision of the problem of Web Data Extraction, and an Evolutionary Computation approach based on Genetic Algorithms and Regular Expressions to the problem of automatically learn software entities. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.
منابع مشابه
Automatic Wrappers for Large Scale Web Extraction
We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform informa...
متن کاملTheory and Algorithms for Information Extraction and Classification in Textual Data Mining
Regular expressions can be used as patterns to extract features from semi-structured and narrative text [8]. For example, in police reports a suspect’s height might be recorded as “{CD} feet {CD} inches tall”, where {CD} is the part of speech tag for a numeric value. The result in [1] shows us that regular expressions could have higher performance than explicit expressions in some applications ...
متن کاملComputational Aspects of Resilient Data Extraction
Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called ”shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites. The main problem one has to con...
متن کاملA New Hybrid model of Multi-layer Perceptron Artificial Neural Network and Genetic Algorithms in Web Design Management Based on CMS
The size and complexity of websites have grown significantly during recent years. In line with this growth, the need to maintain most of the resources has been intensified. Content Management Systems (CMSs) are software that was presented in accordance with increased demands of users. With the advent of Content Management Systems, factors such as: domains, predesigned module’s development, grap...
متن کاملIWrap: Instant Web Wrapper Generator
In this paper, we describe an automatic Web wrapper generator that creates specification files, which contain the schema information and extraction rules for a class of Web pages. These specification files can then used by a wrapper engine (e.g. MIT COIN Grenouille) to extract information from the semi-structured Web sites. We create specification files through a WYSIWYG GUI with minimal user i...
متن کامل